Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add method to query HW size #630

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jeffhammond
Copy link
Member

@jeffhammond jeffhammond commented May 3, 2022

This is my first step towards addressing the oversubscription encountered in #588.

This adds a utility to figure out how many CPUs (hardware threads) are available in the affinity mask of the calling context. This is a stronger restriction than the number of available software threads.

@egaudry

Signed-off-by: Jeff Hammond jehammond@nvidia.com

Open questions (not all need to be addressed in this PR):

  • 1) What about windows?
  • 2) What should happen when called in OMP threaded region?
    • Respect nested OMP settings (e.g. omp_get_max_threads())?
    • Single-threaded?
    • Configurable?
  • 3) What should happen when n_threads > n_hw_cores?
  • 4) When maintaining a pthreads thread pool (also TODO), also use CPU mask from user?
    • No easy way to know when we're in a user parallel region. Should the user be responsible for setting a CPU mask?
  • 5) What if the user doesn't set the CPU mask appropriately when running multiple user threads? (Could potentially be addressed in pthreads mode)
    • Threads for all HW cores could be created and pinned at initialization and "checked out" when needed.

@jeffhammond jeffhammond requested a review from devinamatthews May 3, 2022 06:13
@jeffhammond jeffhammond self-assigned this May 3, 2022
Signed-off-by: Jeff Hammond <jehammond@nvidia.com>
@jeffhammond jeffhammond force-pushed the affinity_mask_size branch from 19de193 to 8bd9160 Compare May 3, 2022 06:38
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
this is gross but i need #define GNU_SOURCE before any other headers, or the CPU_ISSET macro isn't defined.

Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
@devinamatthews
Copy link
Member

Sorry @jeffhammond I hijacked the PR description to flesh out some ideas related to the larger problem of thread pinning.

@jeffhammond
Copy link
Member Author

To be clear, final version will not abort. But I could not figure out how to set BLIS threading variables correctly. That's why there's a preprocessing warning saying "please help me".

@devinamatthews
Copy link
Member

@jeffhammond once we're in the OMP region, it's kind of too late to do anything except, as in bli_l3_thread_decorator_thread_check, detect that we have only one thread when we expected more (e.g. if nested OMP is disabled). I think this check would have to come before the OMP region and initialization of the global communicator.

@devinamatthews
Copy link
Member

@jeffhammond can you walk me through the logic of the PR as implemented? I'm not sure how the CPU mask alone can reliably be used to detect oversubscription.

@jeffhammond
Copy link
Member Author

I agree. We can do slightly better than serialization if nested is enabled but I don't think that's an important thing to spend time on.

@@ -199,6 +200,12 @@ void bli_l3_thread_decorator_thread_check
)
{
dim_t n_threads_real = omp_get_num_threads();
dim_t n_threads_hwmask;
if ( omp_in_parallel() ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinamatthews here, we look for the number of HW threads available to the thread if in OpenMP already, or the process, if not in OpenMP already, and if the user is trying to use more SW threads than HW threads, we abort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this requires the user to have set the CPU mask (or OMP affinity?) correctly. What if I just run BLIS in OpenMP (with nesting) and set both OMP_NUM_THREADS and BLIS_NUM_THREADS equal to the number of cores? Or is this only meant to guard against a more specific use case?

bli_thrcomm_init( n_threads_hwmask, gl_comm );
bli_rntm_set_num_threads_only( n_threads_hwmask, rntm );
#warning HELP ME HERE
bli_rntm_set_ways_only( 1, 1, 1, 1, 1, rntm );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinamatthews this is the part where i need help. i can just do serialization, but i was thinking about trying to do slightly better, e.g. if i have 40 cores and the user wants to run on 80 threads, i can set things up to use 40 threads properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of active threads after thread creation is possible, but not easy. If this logic can go before the OpenMP region then things are much easier.

@jeffhammond
Copy link
Member Author

the affinity mask HW thread count is compared to the user-specified SW thread count.

@devinamatthews
Copy link
Member

Ah OK, so the user could in principle reduce the number of HW cores "visible" to BLIS by setting a mask but if no action is taken then BLIS just sees all cores. OK.

@devinamatthews
Copy link
Member

@jeffhammond sorry to be dense but I still don't see how this check fixes the oversubscription problem in #588. Each "instance" of BLIS would still pass the check, but the total number of threads is more than the number of cores. Detecting nested OMP and dropping down to one core could work, but would have to be configurable. Otherwise, maybe we could check omp_get_max_threads?

@jeffhammond
Copy link
Member Author

I've asked @egaudry to test with his application, which may be settings affinity masks via MPI or HWLOC.

But I can say that it works to detect oversubscription for me. If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.

@devinamatthews
Copy link
Member

If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.

Yes that should work fine, but setting OMP_NUM_THREADS=BLIS_NUM_THREADS=4 and nesting should not abort and yet oversubscribe. Isn't this the more pressing problem?

@devinamatthews
Copy link
Member

Here are my suggested solutions:

  1. Maintain a process-wide atomic int which counts the number of active threads. Abort (or warn etc.) if this exceeds the number of HW cores.
  2. Prepare a set of tokens, one for each HW core. Threads check out these tokens when created. Abort if no tokens are available. This approach can be extended to address thread pinning since each token can indicate a particluar physical core.

@devinamatthews
Copy link
Member

Also, I suggest an environment variable BLIS_OVERSUBSCRIBE. The values ABORT and WARN would do what you expect and any other value ignores oversubscription. Maybe also a SERIALIZE option?

@jeffhammond
Copy link
Member Author

The token idea doesn't work if BLIS runs alongside other things that use threads. But I'll see what I can do.

I disagree that nesting is the higher priority. The bug that motivated all of this was not because of nested OpenMP.

@devinamatthews
Copy link
Member

@egaudry @jeffhammond the issue in #588 was explained here. How is that not nested OMP?

@jeffhammond
Copy link
Member Author

Okay it was nested open but also affinity. I can fix them both.

@egaudry
Copy link

egaudry commented May 10, 2022

Sorry, I haven't had the time to check on my side.

@jeffhammond is correct that the issue was seen because an affinity mask was set
@devinamatthews is also correct that the issue originated from the oversubscription

to clarify, the slowdown occurred because more thread than available physical cores where used in BLIS within a region protected with an openmp lock.
this could happen as well with oversubscription in case of a nested omp loop, with more thread being used than physically available cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants